code.jpg

Pierre Mulliez

Introduction

‘As part of the course “Programming in R” at IE - HST, we foster our newly learned skills by participating in a Kaggle competition. The problem requires us to do Exploratory Data Analysis, Data Cleaning and Manipulation and implement some sort of Machine Learning in R.’

The data can be found here

Exploratory Data Analysis Highlights

In order to carry out this task, we are first going to take a closer look at the data available, to determine which transformations should be taken place before we proceed to data processing stage and then make the predicitive model.

Including Plots: Dates dependent

We have designed several plots to understand and include the right variable in our prediction, first we took a closer look at the stations pattern depending on the dates:

Here an overall visualization

Here visualizing by month - averaging the stations

Including Plots: station coordinates dependent

We have merged the station additional information file with the summary of each stations as a data table to produce a comprehensible comparison between the average of each stations accross all date, its altitude and coordinate

Here we are plotting each station to its location on a map with a color code corresponding to it’s altitude

## Loading required package: ggplot2
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## Using zoom = 7...
## Source : http://tile.stamen.com/terrain/7/27/49.png
## Source : http://tile.stamen.com/terrain/7/28/49.png
## Source : http://tile.stamen.com/terrain/7/29/49.png
## Source : http://tile.stamen.com/terrain/7/30/49.png
## Source : http://tile.stamen.com/terrain/7/27/50.png
## Source : http://tile.stamen.com/terrain/7/28/50.png
## Source : http://tile.stamen.com/terrain/7/29/50.png
## Source : http://tile.stamen.com/terrain/7/30/50.png
## Source : http://tile.stamen.com/terrain/7/27/51.png
## Source : http://tile.stamen.com/terrain/7/28/51.png
## Source : http://tile.stamen.com/terrain/7/29/51.png
## Source : http://tile.stamen.com/terrain/7/30/51.png

Here we are plotting each station to its location on a map with a color code corresponding to it’s average over time

## Using zoom = 7...

Plot conclusion

Correlation among stations

Following the analysis on their relative observation based on their location we thought we could establish some segmentation:

More on stations: distributions:

EDA & DATA PREPARATION

In order to obtain the results displayed above, we needed to prepare and the data:
  • Finding NA ratio per column in order to see their potential influence the analysis
  • We needed to find and map the extreme outliers

Outliers

Finding outliers

DATA PROCESSING

  • Once we had our outliers mapped, we replaced them with NA values in order to decrease the potential bias
  • Then once we have tackled the outliers, it was time to split the data and start training our model

Splitting into training and testing sets.

We will be training our model with 70% of the original data, valuating and then testing with 15% each.
train_index <- sample(1:nrow(solar_data), 0.7*nrow(solar_data))
val_index <- sample(setdiff(1:nrow(solar_data), train_index), 0.15*nrow(solar_data));  
test_index <- setdiff(1:nrow(solar_data), c(train_index, val_index))

BUILDING THE MODELS

While looking for our top performing model, we looked at the variety of different algorithms that could be used for this purpose, including xgboost and svm. After testing multiple models, we came to the conclusion that X model [TBA based on Max’s results] was best suited for the purposes of this assigments, so we are going to showcase it below.